Neural Style Transfer

16-726 Assignment #4
Michael Mu (mmu2)

Introduction

The goal of this assignment is to implement neural style transfer. We use a pretrained VGG19 model to extract features from the content and style images. We then optimize the input image to minimize the content loss and style loss.

After this, I implemented iterative "bells and whistles" improvements:

Video Style Transfer
Stylize Prior Works
Spatial Control

Content Reconstruction

For this task, I implemented a simple reconstruction loss between an image and its reconstruction using MSE loss. The loss takes place after layers which a user can specify. I experiment with using different layers for the reconstruction loss location to see how different layers affect the reconstruction.

Report the effect of optimizing texture loss at different layers. Use one of the configurations; specify it in the website and: [15 points]

We can see here that the images get progressively more noisy as we move to deeper layers. This makes sense, as the higher frequency details are preserved in earlier layers. Low frequency, semantically meaningful qualities are preserved in later layers, but images can get quite noisy/blurry in the meantime. Using multiple layers can provide a mix of performance, but I believe that just using an earlier layer is sufficient quality-wise while also reducing complexity. I don't want to choose too early in the model though, otherwise, we will preserve too much of the high frequency details that we want replaced with a different style later.

Choose your favorite one (specify it on the website). Take two random noises as two input images, optimize them only with content loss. Please include your results on the website and compare each other with the content image. [15 points]

Reconstructed Image minus Original Image

As you can see, there's very little difference between the outputs of different noise. There's some hazy dots in the reconstruction difference, where I subtracted the reconstructed image from the original content image. They are not very noticeable. This shows that the reconstruction from different noises can get very close to the original content image.

Texture Synthesis

The style loss finds the MSE loss between the gram matrix of the style image and the gram matrix of the input image. We minimize this to get the style of the input image to match the style of the style image.

Report the effect of optimizing texture loss at different layers. Use one of the configurations; specify it in the website and: [15 points]

Here, we can see that earlier layers have better quality textures (high frequency details are better), but global details from the style image are not preserved. As we move to later layers, the global details are better preserved and begin showing up, but the textures and smaller details become quite distorted. I preferred a mix of the two, like style loss at conv layers 2, 4, 6, 8, and 10.

Take two random noises as two input images, optimize them only with style loss. Please include your results on the website and compare these two synthesized textures. [15 points]

The difference between the synthesized textures looks quite nice itself.

Again, different noise initializations produce outputs that are quite similar. That said, textures/styles are by nature, much higher frequency than content, so there is a more significant difference, which we can see when we subtract the two. Interestingly, the difference between the two synthesized textures looks quite nice itself.

Style Transfer

As we had learned in class, style transfer loss uses a gram matrix to represent the style of the image in either a feature space or pixel space. I mainly used feature space, and we then minimized the image to match styles with a model's feature space's gram matrix after certain layers that a user can select.

Tune the hyper-parameters until you are satisfied. Pay special attention to whether your gram matrix is normalized over feature pixels or not. It will result in different hyper-parameters by an order of 4-5. Please briefly describe your implementation details on the website. [10 points]

For each image and style, some slight hyperparater tuning is need to get better outputs. However, I found that there was a good balance at:

Content_layers_default = ['conv_4']
style_layers_default = ['conv_2', 'conv_4', 'conv_6', 'conv_8', 'conv_10']
style_weight between 1,000 and 1,000,000
content_weight = 1

For the most part, I only need to adjust the style weight to make the style influence heavier or lighter depending on the complexity of the content image. If the content image (like dancing) is easily overwhelmed by the style image (like starry night), then it is a good idea to reduce the style_weight to 1,000. For lighter styles, like picasso, the style_weight can be increased to 100,000/1,000,000. We will also see later that the quality of output also depends on the initialization approach, and some hyperparameters will only work well with noise initialization while others work well with content initialization.

Please report at least a 2x2 grid of results that are optimized from two content images mixing with two style images accordingly. (Remember to also include content and style images therefore the grid is actually 3x3) [10 points]

Take input as random noise and a content image respectively. Compare their results in terms of quality and running time. [10 points]

Time when initializing with random noise: 720.0257666110992

Time when initializing with content image: 718.3290090560913

The times are extremely close, so there is not much of a difference. The content image initialization is always consistently faster though. It does make some sense that initializing with the content image should make the optimization faster, since we are closer to the optimal solution than random noise. However, we hardcoded the number of optimization steps, so closer starting point does not decrease the number of optimization operations. Additionally, when printing the loss, we see the loss drops below 1 within a few iterations. After this, the optimization is very similar between noise and content initialization. If I implemented some early stopping behavior which stopped iterations when the loss converged, we would probably see a much faster time for initialization with the content image.

Quality-wise, we can see that the images are much more faithful (color intensities are closer) to the original content image with content initialization. However, both are quite good. In some experiments, I also saw that the style transfer fails to converge with noise initialization. This is likely because the optimization is stuck in some other local minimum.

Try style transfer on some of your favorite images.

Bells & Whistles

Apply style transfer to a video. You could try frame by frame method (2pts) or applying temporal smoothness for better output(4pts)

Video style transfer took quite a bit of work. First, I found an online repository that worked on a similar problem, but their model learns style transfer with a loss function that encourages temporal smoothness. The end result is similar to what we want, but it does not use optimization to accomplish this, which is what we want to do.

Original video

Pretrained model on loss

I used their pretrained SPyNet model to calculate the optical flow between frames. The transformation of the input image depending on the optical flow to the next frame is then used as an additional loss term in the optimization problem. The results (left) were not very good. After investigating, I found that the optical flow was pretty random looking, which meant the training did not have a good representative loss function to optimize on.

As a result, I utilized the more powerful DeepFlow2 to calculate the optical flow. With this, the style stabilized to a stronger degree. That said, there are a couple frames, where it seems that the optical flow is not calculated correctly, and the result is a radical change to the style, which causes the flashes you see. It should be noted that videos are very computationally expensive, so I did this with 128x128 frames, and only 4 seconds of video. In addition I reduced the optimization problem to 5 iterations per frame. Even with these changes, the results seem decent. The video I used is from a video of someone walking around London that I pulled from YouTube.

Naive Video Style Transfer

Video Style Transfer with Temporal Smoothness

Stylize your grump cats or Poisson blended images from the previous homework. (2pts)

I thought the Poisson bear may look better as a drawing. I was right.

Picasso style adds a darkness that's kinda scary.

Add some spatial control by masking out certain regions (4 pts).

I utilized a modified version of homework 2's masking code to create a mask from the content image.

White is the region for style transfer. Black is masked out.

I first applied a naive approach, in which I simply masked out the content image at each iteration. It honestly looks quite decent. In comparison, I also utilized the implementation given in Controlling Perceptual Factors in Neural Style Transfer. This approach finds the activations that a mask corresponds to, and uses this to influence the gram matrix. This approach has a more involved loss function, but the results end up unsatisfactory in my opinion. I believe the reason for this comes from the fact that the cropped sky, is a uniform blue, so it does not have many characteristics for a style to transfer onto, resulting in a fairly bland pattern. That said, the edge of the mask is then able to contribute to the style transfer, and while not totally noticeable in this example, you'll see that there tends to be a gradient from the edge of the mask.